Basketball Analytics with R

Mathew Chandy

History of NBA Box Score

  • 1946-1947: PTS, AST, FG/FGM, FGA, FG%, FTM, FTA, FT%, PF

  • 1950-1951: TRB/REB

  • 1951-1952: MP

  • 1973-1974: ORB/OREB, DRB/DREB, STL, BLK

  • 1977-1978: TOV

  • 1979-1980: 2P/2PM, 2PA, 3P/3PM, 3PA

  • 1996-1997: Shot distance tracking is introduced.

Four Factors

Offense: \(eFG\% = \frac{ (2PM)_T + 1.5 \times (3PM)_T }{ (2PA)_T + (3PA)_T}\)

\(TO = \frac{TOV_T}{POSS_T}\)

\(REB\% = \frac{OREB_T}{OREB_T + DREB_O}\)

\(FT\) Rate \(= \frac{FTM_T}{(2PA)_T + (3PA)_T}\)

The Four Factors by Kubatko, J., Oliver, D., Pelton, K., and Rosenbaum, D. T. (2007). A starting point for analyzing basketball statistics}. Journal of Quantitative Analysis in Sports, 3(3):1–22

Four Factors

Defense: \(eFG\% = \frac{ (2PM)_O + 1.5 \times (3PM)_O }{ (2PA)_O + (3PA)_O}\)

\(TO = \frac{TOV_O}{POSS_O}\)

\(REB\% = \frac{DREB_T}{OREB_O + DREB_T}\)

\(FT\) Rate \(\frac{FTM_O}{(2PA)_O + (3PA)_O}\)

The Four Factors by Kubatko, J., Oliver, D., Pelton, K., and Rosenbaum, D. T. (2007). A starting point for analyzing basketball statistics}. Journal of Quantitative Analysis in Sports, 3(3):1–22

Acquiring Data

Basketball Reference

Loading Data through Basketball Reference

Tidyverse

install.packages("tidyverse", repos = "http://cran.us.r-project.org")

Package for Acquiring Data

if (!requireNamespace('devtools', quietly = TRUE)){
  install.packages('devtools')
}
devtools::install_github("sportsdataverse/sportsdataverse-R")

Package for Visualizing and Analyzing Data

devtools::install_github("sndmrc/BasketballAnalyzeR")

Four Factors

Shot Chart

Assist Network

Expected Points

Expected value: \(y_i\) are all the possible values of a discrete variable \(Y\).

\(E(Y) = \sum_{i = 1}^n y_i P(Y = y_i)\)

Expected Points

Let \(Y\) be the points resulting from a 2-point shot attempt. What are the possible values \(y_i\) of \(Y\)? \(\{0, 2\}\)
In this context, what is \(P(Y = y_i)\)? 2-point field goal percentage
Jayson Tatum’s 2PFG% this season is 54%. What is his expected points on a 2-point field goal attempt? 1.08

Expected Points

Conditional expected value: let \(X\) be some variable that changes the probability of \(y_i\).

\(E(Y | X) = \sum_{i = 1}^n y_i P(Y = y_i | X)\).

Cluster Analysis

March Madness

How many possible March Madness bracket outcomes are there?

Hint: how many games are there in the tournament not including the First Four, and how many possible outcomes are there for each game? Answer: \(2^{63}\) or \(9,223,372,036,854,775,808\)

The standard scoring of a bracket is

Round One: 1

Round Two: 2

Sweet Sixteen: 4

Elite Eight: 8

Final Four: 16

Championship: 32

A perfect bracket gets a score of 192.

Ranking by some metric

This could be as simple as using AP rankings, or you could develop your own metric. You can evaluate your metric based on how it performs on past tournaments.

Ranking by some metric

Example:

Ranking by some metric

  • We can pick the best team at each stage
  • Strict hierarchy is unrealistic and one prediction for a tournament has a lot of uncertainty

How do we model how good a team is?

We can predict the probability of a team winning a certain March Madness game.

What are some models that can be used for binary classification? Logistic Regression, Decision Trees, Random Forest, SVM, Neural Network, etc.

What are some possible features?

Strength of Schedule, Performance in Recent Games, Performance in Recent Seasons, Injuries/Suspensions, Location, Player Matchups, Offensive/Defensive Tendencies

“Probabilistic” approach

Depending on choice of model, it may be possible that the team most likely to advance at one stage may be less likely to advance at a future stage.

Example:

“Probabilistic” approach

  • Can account for differences in playstyles
  • Not practical to compute (9,223,372,036,854,775,808 different possibilities to consider)

How do we model how good a team is?

We can predict how many points a team will score in a March Madness game.

What kind of variable can we use to model a count variable? Poisson or Binomial, Poisson is easier because there is only one parameter.

Regression works for continuous variables that have a support of \((-\infty, \infty)\), so we must use a link function to map the counts to real variables.

Poisson Regression

For regression, let \(Y\) be the number of points scored by the team of interest, and let \(x_j\) be the \(j\)th predictor out of \(n\).

Then \(\log(E(Y | x)) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n\)

Simulation approach

For each team in a game, we can draw from Pois\((\lambda)\), where \(\lambda\) is the predicted response from our regression for that team’s points. The team that scores more points advances. We can simulate a tournament as many times as we want. Then we can get an idea of how likely a team is to make it to a certain round. Note that the most likely bracket may not coincide with the most likely winner.

Simulation approach

Simulation approach

  • Doesn’t require as much processing
  • Hard to find optimal bracket

Vibes Bracket

Pick the teams you think will win, or the teams you personally want to win. It worked great for UConn students! Just make sure you don’t pick too few or too many upsets.

Picking Upsets

NCAA Average Number of Upsets:

  • Total Upsets: 8.5
  • First Round: 4.65
  • Second Round: 3.13
  • Elite Eight: 0.31
  • Final Four: 0.10

The End